Cleaning the weather dataset

In this notebook, we’ll be using numpy and pandas.

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.

Numpy is the fundamental package for scientific computing with Python.

Let’s install the packages pandas and numpy.

## Uncomment and run this cell to install pandas and numpy
#!pip install pandas numpy
# import the libraries
import pandas as pd
import numpy as np
from dataidea.datasets import loadDataset

Let’s check the versions of python, numpy and pandas we’ll be using for this notebook

# checking python version
print('Python Version: ',)
!python --version
Python Version: 
Python 3.10.12
# Checking numpy and pandas versions
print('Pandas Version: ', pd.__version__)
print('Numpy Version: ', np.__version__)
Pandas Version:  2.2.1
Numpy Version:  1.26.4

Let’s load the dataset. We’ll be using a weather dataset that imagined for learning purposes.

# load the dataset
weather_data = loadDataset('weather')

We can sample out random rows from the dataset using the sample() method, we can use the n parameter to specify the number of rows to sample

# sample out random values from the dataset
weather_data.sample(n=5)
day temperature windspead event
2 05/01/2017 28.0 NaN Snow
8 11/01/2017 40.0 12.0 Sunny
7 10/01/2017 34.0 8.0 Cloudy
1 04/01/2017 NaN 9.0 Sunny
4 07/01/2017 32.0 NaN Rain

Display some info about the dataset eg number of entries, count of non-null values and variable datatypes using the info() method

# get quick dataframe info
weather_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          9 non-null      object 
 1   temperature  5 non-null      float64
 2   windspead    5 non-null      float64
 3   event        7 non-null      object 
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes

We can count all missing values in each column in our dataframe by using dataframe.isna().sum(), eg

# count missing values in each column
weather_data.isna().sum()
day            0
temperature    4
windspead      4
event          2
dtype: int64

We can use a boolean-indexing like technique to find all rows in a dataset with missing values in a specific column.

# get rows with missing data in temperature
weather_data[weather_data.temperature.isna()]
day temperature windspead event
1 04/01/2017 NaN 9.0 Sunny
3 06/01/2017 NaN 7.0 NaN
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN NaN
# get rows with missing data in event column
weather_data[weather_data.event.isna()]
day temperature windspead event
3 06/01/2017 NaN 7.0 NaN
6 09/01/2017 NaN NaN NaN

For the next part, we would like to demonstrate forward fill (ffill()) and backward fill (bfill), we first create two copies of the dataframe to avoid modifying our original copy in memory. - ffill() fills the missing values with the previous valid value in the column - bfill() fills the missing values with the next valid value in the column

# Create copies of a dataframe
weather_data1 = weather_data.copy()
weather_data2 = weather_data.copy()
# fill with the previous valid value
weather_data1['event'] = weather_data1.event.ffill()
weather_data1
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 Snow
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN Sunny
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
weather_data
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 NaN
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN NaN
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
# fill with the next valid value in the column
weather_data2['event'] = weather_data2.event.bfill()
weather_data2
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 Rain
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN Cloudy
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

We can modify (or fill) a specific value in the dataframe by using the loc[] method. This picks the value by its row (index) and column names. Assigning it a new value modifies it in the dataframe as illustrated below

# modify a specific value in the dataframe
weather_data2.loc[1, 'temperature'] = 29
weather_data2
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 29.0 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 Rain
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN Cloudy
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

We can use the fillna() method to replace all missing values in a column with a specific value as demostrated value

# replace missing values in temperature column with mean
weather_data2['temperature'] = weather_data2.temperature.fillna(
    value=weather_data2.temperature.mean()
)
weather_data2
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 29.0 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 32.5 7.0 Rain
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 32.5 NaN Sunny
6 09/01/2017 32.5 NaN Cloudy
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
# create a copy of weather_data2
weather_data22 = weather_data2.copy()
# Replace missing values in windspead column with a specific value
weather_data22['windspead'] = weather_data2.windspead.fillna(value=7.5)
weather_data22
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 29.0 9.0 Sunny
2 05/01/2017 28.0 7.5 Snow
3 06/01/2017 32.5 7.0 Rain
4 07/01/2017 32.0 7.5 Rain
5 08/01/2017 32.5 7.5 Sunny
6 09/01/2017 32.5 7.5 Cloudy
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

We can also use the fillna() method to fill missing values in multiple columns by passing in the dictionary of key/value pairs of column-name and value to replace. Before we demonstrate this, let’s create a copy of the dataframe to avoid modifying the original in memory

# create a copy of the weather_data dataframe
weather_data3 = weather_data.copy()
# Replace missing values in temperature, column and event
weather_data3.fillna(value={
    'temperature': weather_data3.temperature.mean(), 
    'windspead': weather_data3.windspead.max(), 
    'event': weather_data3.event.bfill()
    }, inplace=True)
weather_data3
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 33.2 9.0 Sunny
2 05/01/2017 28.0 12.0 Snow
3 06/01/2017 33.2 7.0 Rain
4 07/01/2017 32.0 12.0 Rain
5 08/01/2017 33.2 12.0 Sunny
6 09/01/2017 33.2 12.0 Cloudy
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

We can optionally drop all rows with missing values using the dropna() method. Before we demonstrate that, let’s first create a dataframe copy to avoid modifying the original in the memory

# create a copy of weather_data dataframe
weather_data4 = weather_data.copy()
# Drop all rows with missing values
weather_data4.dropna()
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
Back to top